feat(BA-3435): Implement Rolling Update deployment strategy#9997
Conversation
d7a7761 to
10ddb6b
Compare
5939e01 to
d296163
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a Rolling Update deployment strategy evaluator (pure-function FSM) and refactors DEPLOYING timeout handling to use the coordinator’s standard expired transition mechanism, while also renaming/standardizing deployment sub-step typing and exposing the sub-step via the deployment API.
Changes:
- Implement
RollingUpdateStrategy.evaluate_cycle()with surge/unavailability budgeting and route mutation outputs. - Replace/standardize deployment sub-step handling with
DeploymentLifecycleSubStepacross coordinator/handlers/repos and add skipped-timeout checks that driveexpired → DEPLOYING/ROLLING_BACK. - Add fallback revision-spec loading from endpoint-level fields and surface
sub_stepin deployment DTO/API.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/manager/sokovan/deployment/strategy/test_rolling_update.py | New unit tests covering rolling update FSM outcomes and budgeting. |
| tests/unit/manager/sokovan/deployment/strategy/test_applier.py | Update tests for new sub-step enum and completed detection; adjust fixtures. |
| tests/unit/manager/sokovan/deployment/executor/conftest.py | Extend repo mock to support new get_revision_spec_from_endpoint() path. |
| src/ai/backend/manager/sokovan/deployment/strategy/types.py | Update strategy result types to use DeploymentLifecycleSubStep. |
| src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py | Implement rolling update route classification + create/drain decisions. |
| src/ai/backend/manager/sokovan/deployment/strategy/evaluator.py | Adjust bulk route fetching conditions (incl. terminated new-revision routes). |
| src/ai/backend/manager/sokovan/deployment/strategy/applier.py | Update completed detection to DEPLOYING_COMPLETED; simplify applier surface. |
| src/ai/backend/manager/sokovan/deployment/handlers/deploying.py | Add expired transition for provisioning; refactor rolling-back cleanup to repo. |
| src/ai/backend/manager/sokovan/deployment/handlers/base.py | Docstring alignment to new sub-step naming. |
| src/ai/backend/manager/sokovan/deployment/executor.py | Use endpoint-level revision spec fallback when no current revision exists. |
| src/ai/backend/manager/sokovan/deployment/deployment_controller.py | Update controller API to accept DeploymentLifecycleSubStep. |
| src/ai/backend/manager/sokovan/deployment/coordinator.py | Wire sub-step filtering, add skipped-timeout expiration handling, update task specs. |
| src/ai/backend/manager/services/deployment/service.py | Include sub_step in deployment data conversion; update lifecycle marking callsites. |
| src/ai/backend/manager/repositories/deployment/repository.py | Add sub_steps filtering to handler fetch; add get_revision_spec_from_endpoint(). |
| src/ai/backend/manager/repositories/deployment/db_source/db_source.py | Implement sub-step filtering + endpoint-based revision spec builder query. |
| src/ai/backend/manager/repositories/deployment/creators/deployment.py | Update lifecycle batch updater spec to use DeploymentLifecycleSubStep. |
| src/ai/backend/manager/models/endpoint/row.py | Switch sub-step column type + add build_revision_spec_from_endpoint() helper. |
| src/ai/backend/manager/event_dispatcher/handlers/schedule.py | Decode deployment sub-step using new enum type. |
| src/ai/backend/manager/data/deployment/types.py | Introduce DeploymentLifecycleSubStep; add RouteStatus.is_provisioning(). |
| src/ai/backend/manager/api/rest/deployment/adapter.py | Map sub_step into REST DTO conversion. |
| src/ai/backend/common/dto/manager/deployment/response.py | Add sub_step field to deployment response DTO. |
| proposals/BEP-1049/rolling-update.md | Update BEP doc to match new handler/timeout flow and FSM semantics. |
| proposals/BEP-1049-deployment-strategy-handler.md | Update design doc to reflect 2 DEPLOYING handlers + skipped-timeout expiry behavior. |
| changes/9997.feature.md | Changelog entry for rolling update strategy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| new_healthy=2, old=3 → can_terminate=2, old=3 → 2 (budget-limited) | ||
| new_healthy=4, old=1 → can_terminate=2, old=1 → 1 (old-count-limited) | ||
| """ | ||
| available_count = classified.new_healthy_count + len(classified.old_active) |
There was a problem hiding this comment.
_compute_routes_to_terminate() treats every old_active route as "available" by using len(classified.old_active), but old_active includes PROVISIONING/UNHEALTHY/DEGRADED routes because RouteStatus.is_active() returns true for them. This can overestimate available_count and allow terminating additional old routes beyond the min_available budget (potential downtime). Consider tracking/counting healthy old routes separately (or filtering old_active by status == HEALTHY) when computing available_count and can_terminate, while still keeping the full old_active list for termination ordering.
| available_count = classified.new_healthy_count + len(classified.old_active) | |
| old_healthy_count = sum( | |
| 1 for route in classified.old_active if route.status == RouteStatus.HEALTHY | |
| ) | |
| available_count = classified.new_healthy_count + old_healthy_count |
| # Fetch non-terminated routes + terminated routes belonging to a | ||
| # deploying revision. The FSM needs terminated new-revision routes | ||
| # to count accumulated failures for rollback detection, but old | ||
| # terminated routes are irrelevant and would bloat the result set. | ||
| deploying_revision_ids = { | ||
| deployment.deploying_revision_id | ||
| for deployment in deployments | ||
| if deployment.deploying_revision_id is not None | ||
| } | ||
| route_conditions: list[QueryCondition] = [ | ||
| RouteConditions.by_endpoint_ids(endpoint_ids), | ||
| ] | ||
| if deploying_revision_ids: | ||
| route_conditions.append( | ||
| combine_conditions_or([ | ||
| RouteConditions.exclude_statuses([RouteStatus.TERMINATED]), | ||
| RouteConditions.by_revision_ids(deploying_revision_ids), | ||
| ]) | ||
| ) |
There was a problem hiding this comment.
The comment and query logic here say terminated new-revision routes are needed for "rollback detection", but RollingUpdateStrategy currently never uses new_failed_count for any decision (it only logs it) and the tests/docs state rollback is handled by coordinator timeout. If rollback detection is no longer part of the FSM, consider simplifying the route query back to excluding TERMINATED routes (or updating the comment to reflect the real reason for including terminated routes) to avoid extra result-set bloat and confusion.
| def test_only_failed_new_no_old_rolls_back(self) -> None: | ||
| """Only failed new routes, no old → PROVISIONING (retries creation).""" | ||
| deployment = make_deployment(desired=2) | ||
| spec = RollingUpdateSpec(max_surge=1, max_unavailable=0) | ||
| routes = [ | ||
| make_route(revision_id=NEW_REV, status=RouteStatus.FAILED_TO_START), | ||
| make_route(revision_id=NEW_REV, status=RouteStatus.FAILED_TO_START), | ||
| ] | ||
|
|
||
| result = RollingUpdateStrategy(spec).evaluate_cycle(deployment, routes) | ||
|
|
||
| assert result.sub_step == DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING | ||
|
|
There was a problem hiding this comment.
The test name test_only_failed_new_no_old_rolls_back is misleading: the assertion expects DEPLOYING_PROVISIONING, and the docstring also says it stays in PROVISIONING. Consider renaming the test (and/or updating the docstring) to reflect the actual behavior (retry/wait rather than rollback).
| @pytest.fixture | ||
| def mixed_summary() -> tuple[StrategyEvaluationSummary, UUID, UUID]: | ||
| def rolled_back_summary() -> tuple[StrategyEvaluationSummary, set[UUID]]: | ||
| ep_id = uuid4() | ||
| summary = _build_summary({ep_id: DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING}) | ||
| return summary, {ep_id} | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def mixed_summary() -> tuple[StrategyEvaluationSummary, UUID, UUID, UUID]: | ||
| provisioning_id = uuid4() | ||
| completed_id = uuid4() | ||
| rolled_back_id = uuid4() | ||
| summary = _build_summary( | ||
| { | ||
| provisioning_id: DeploymentSubStep.PROVISIONING, | ||
| completed_id: DeploymentSubStep.COMPLETED, | ||
| provisioning_id: DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING, | ||
| completed_id: DeploymentLifecycleSubStep.DEPLOYING_COMPLETED, | ||
| rolled_back_id: DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING, | ||
| }, | ||
| route_changes=RouteChanges( | ||
| rollout_specs=[MagicMock()], | ||
| drain_route_ids=[uuid4()], | ||
| ), | ||
| ) | ||
| return summary, provisioning_id, completed_id | ||
| return summary, provisioning_id, completed_id, rolled_back_id |
There was a problem hiding this comment.
rolled_back_summary fixture is both unused in this test module and misleadingly named (it assigns DEPLOYING_PROVISIONING). Consider removing it (and the extra rolled_back_id in mixed_summary if it isn't needed) or renaming it to match the actual sub_step being tested to keep the applier tests focused and clear.
0d5fc37 to
b94732d
Compare
de3fb63 to
89391b8
Compare
…rolling update PR Move non-rolling-update-evaluator changes to the base refactoring PR: - Coordinator: sub_step filtering, expired transition for skipped deployments - Deploying handlers: expired transition, rolling_back post_process - Executor: route creation refactoring - Repository/DB source: sub_steps filter parameter - Strategy applier: remove clear_deploying_revision (moved to repo) - Strategy types: docstring updates - BEP-1049 proposal updates - Test fixtures updates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sses/failures/skipped) Replace the 4-field result (successes, errors, skipped, need_retry) with the 3-field pattern used by session coordinator: successes, failures, skipped. Handlers now report all non-success outcomes as failures (DeploymentExecutionError). The coordinator classifies failures into need_retry/expired/give_up based on retry count and timeout policy, matching the session side's approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resolves #7384 (BA-3435)
Overview
Implements the Rolling Update deployment strategy FSM (BEP-1049) — a pure-function evaluator that gradually replaces old-revision routes with new-revision routes, respecting surge and unavailability budgets.
Also refactors DEPLOYING timeout handling: removes the separate
CHECK_DEPLOYING_TIMEOUTlifecycle type andDeployingTimeoutHandler, replacing them with the standardexpiredtransition mechanism onDeployingProvisioningHandler. Timeout is now checked viaphase_started_atfrom scheduling history (which is not reset on retries due to history merge), eliminating the need for thedeploying_started_atcolumn.Architecture
Cycle-by-Cycle Example (
desired=3, max_surge=1, max_unavailable=1)Safety Guards
max_unavailable < desired, never terminates ALL old routes until at least one new route is healthyRollingUpdateSpecvalidator ensures at least one ofmax_surgeormax_unavailableis positiveDeploying Timeout Refactor
Previously, deploying timeout was handled by a separate
DeployingTimeoutHandlerregistered underCHECK_DEPLOYING_TIMEOUTlifecycle type, running as an independent periodic task. This has been unified with the standardexpiredtransition mechanism:DeployingProvisioningHandlernow declares anexpiredtransition (→ DEPLOYING/ROLLING_BACK)phase_started_atfrom scheduling historyphase_started_atis stable across retries (history records are merged viashould_merge_with, incrementingattemptswithout changingcreated_at)deploying_started_atcolumn and its migration have been removed entirelyKey Types
StrategyCycleResultstrategy/types.pyRouteChangesstrategy/types.pyRollingUpdateSpecmodels/deployment_policy/row.pyAbstractDeploymentStrategystrategy/types.pyRollingUpdateStrategyimplementsChanged Files
strategy/rolling_update.pyhandlers/deploying.pyDeployingTimeoutHandler, addexpiredtransition to provisioning handlercoordinator.pyCHECK_DEPLOYING_TIMEOUTlifecycle, add skipped-timeout check, usephase_started_atuniformlytypes.pyCHECK_DEPLOYING_TIMEOUTenum memberdata/deployment/types.pydeploying_started_atfieldmodels/endpoint/row.pydeploying_started_atcolumntest_rolling_update.pyTest Coverage
TestBasicFSMStatesTestMaxSurgeTestMaxUnavailableTestCombinedSurgeAndUnavailableTestMultiCycleProgressionTestMixedRouteStatusesTestTerminationPriorityTestEdgeCasesTestRouteCreatorSpecsTestRealisticScenarioTestDeadlockAndStallTestDesiredReplicaCountTestScaleDownDuringRollingUpdateTestConcurrentOperationsRelated
DEPLOYINGhandlers #9995 (BA-5067)Checklist: (if applicable)
ai.backend.testdocsdirectory📚 Documentation preview 📚: https://sorna--9997.org.readthedocs.build/en/9997/
📚 Documentation preview 📚: https://sorna-ko--9997.org.readthedocs.build/ko/9997/